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Abstract In this paper we develop random block coordinate gradient descent 
methods for minimizing large scale linearly constrained separable convex prob¬ 
lems over networks. Since we have coupled constraints in the problem, we devise 
an algorithm that updates in parallel r > 2 (block) components per iteration. 
Moreover, for this method the computations can be performed in a distributed 
fashion according to the structure of the network. However, its complexity per 
iteration is usually cheaper than of the full gradient method when the number of 
nodes N in the network is large. We prove that for this method we obtain in expec¬ 
tation an e-accurate solution in at most 0{-^) iterations and thus the convergence 
rate depends linearly on the number of (block) components t to be updated. For 
strongly convex functions the new method converges linearly. We also focus on 
how to choose the probabilities to make the randomized algorithm to converge 
as fast as possible and we arrive at solving a sparse SDP. Finally, we describe 
several applications that fit in our framework, in particular the convex feasibility 
problem. Numerically, we show that the parallel coordinate descent method with 
r > 2 accelerates on its basic counterpart corresponding to r = 2. 


1 Introduction 

The performance of a network composed of interconnected subsystems can be im¬ 
proved if the traditionally separated subsystems are optimized together. Recently, 
coordinate descent methods have emerged as a powerful tool for solving large data 
network problems: e.g. resource allocation Ema, coordination in multi-agent sys¬ 
tems ElCZiIIgl, estimation in sensor networks or distributed control m, image 
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processing [3i mil9| and other areas l6lfT3lfra. The problems we consider in this pa¬ 
per have the following features: the size of data is big so that usual methods based 
on whole gradient computations are prohibitive. Moreover the incomplete structure 
of information (e.g. the data are distributed over the nodes of the network, so that 
at a given time we need to work only with the data available then) may also be 
an obstacle for whole gradient computations. In this case, an appropriate way to 
approach these problems is through coordinate descent methods. These methods 
were among the hrst optimization methods studied in literature but until recently 
they haven’t received much attention. 

The main differences in all variants of coordinate descent methods consist in the 
criterion of choosing at each iteration the coordinate over which we minimize the 
objective function and the complexity of this choice. Two classical criteria used 
often in these algorithms are the cyclic and the greedy coordinate descent search, 
which signihcantly differs by the amount of computations required to choose the 
appropriate index. For cyclic coordinate search estimates on the rate of conver¬ 
gence were given recently in [2] , while for the greedy coordinate search (e.g. Gauss- 
Southwell rule) the convergence rate is given e.g. in |16| . One paper related to our 
work is [1] , where a 2-coordinate greedy descent method is developed for minimiz¬ 
ing a smooth function subject to a single linear equality constraint and additional 
bound constraints on the decision variables. Another interesting approach is based 
on random choice rule, where the coordinate search is random. Recent complex¬ 
ity results on random coordinate descent methods for smooth convex objective 
functions were obtained in [Z1[T2]. The extension to composite convex objective 
functions was given e.g. in These methods are inherently serial. Recently, 

parallel and distributed implementations of coordinate descent methods were also 
analyzed e.g. in |6ll^ [T0l[T5] . 

Contributions: In this paper we develop random block coordinate gradient descent 
methods suited for large optimization problems in networks where the information 
cannot be gather centrally, but rather it is distributed over the network. Moreover, 
in our paper we focus on optimization problems with linearly coupled constraints 
(i.e. the constraint set is coupled). Due to the coupling in the constraints we 
introduce a r > 2 block variant of random coordinate gradient descent method, 
that involves at each iteration the closed form solution of an optimization problem 
only with respect to t block variables while keeping all the other variables fixed. 
Our approach allows us to analyze in the same framework several methods: full 
gradient, serial random coordinate descent and any parallel random coordinate 
descent method in between. For this method we obtain for the expected values of 
the objective function a convergence rate where k is the iteration counter 

and N is the number of nodes in the network. Thus, the theoretical speedup in 
terms of the number of iterations needed to approximately solve the problem, as 
compared to the basic method corresponding to t = 2, is an expression depending 
on the number of components t to be updated (processors) and for a complete 
network the speedup is equal to t (number of components updated). This result 
also shows that the speedup achieved by our method on the class of separable 
problems with coupling constraints is the same as for separable problems without 
coupling constraints. For strongly convex functions we prove that the new method 
converges linearly. We also focus on how to choose the probabilities to make the 
randomized algorithm to converge as fast as possible and we arrive at solving 
sparse SDPs. While the most obvious benefit of randomization is that it can lead 
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to faster algorithms, either in worst case complexity analysis and/or numerical 
implementation, there are also other benefits of our algorithm that are at least as 
important: e.g., the use of randomization leads to a simpler algorithm that is easier 
to analyze, produces a more robust output and can often be organized to exploit 
modern computational architectures (e.g distributed and parallel computers). 
Contents: The paper is organized as follows. In Section [5] we introduce our opti¬ 
mization model and assumptions. In Section [3] we propose a random block coor¬ 
dinate descent algorithm and derive the convergence rate in expectation. Section 
|T] provides means to choose optimally the probability distribution. In Section [5] 
we discuss possible applications and we conclude with some preliminary numerical 
results in Section (6] 

Notation: We work in the space composed by column vectors. For x,y & 
denote the standard Euclidian inner product {x, y) = x"^y and the Euclidian 
norm ||a;|| = {x, x)^^^. For symmetric matrices X, Y we consider the inner product 
(X, Y) = trace(Xy). We use the same notation (•, •) and || ■ || for spaces of different 
dimension. We define the partition of the identity matrix: /jv = [ei • • ■ ejv], where 
Ci G . Then, for any x G we write x = CiXi. Moreover, denotes the 
diagonal matrix with the entries x on the diagonal and x~^ = [x^^ ■ ■ ■ We 

denote with e G the A^—dimensional vector with all entries equal to one. For 
a positive semidefinite matrix W G we consider the following order on its 

eigenvalues 0 < Ai < • • • < Ajv and ||a; 11^ = x^Wx. 


2 Problem formulation 

We consider large data network optimization problems where each agent in the 
network is associated with a local variable so that their sum is fixed and we need 
to minimize a separable convex objective function: 

/* = min /i{a;i) H-h /jv(a:jv) 

s.t.: xi -h • • • -h aJjv = 0. 

For convenience, we will focus on scalar convex functions /i : K —>■ K, i.e. Xi G M, 
in the optimization model O- However, our results can be easily extended to the 
block case, when Xi G M", with n > 1, using the Kronecker product. Moreover, 
constraints of the form aixi -|- • • • -|- ajvSjv = b, where Xi G R” and oti G K, can 
be easily handled in our framework by a change of coordinates. 

Optimization problems with linearly coupled constraints m arise in many areas 
such as resource allocation [3IIZI, coordination in multi-agent systems [SllZlIIH], 
image processing [SlIU llOlfT^ and other areas [iiiiiis]. For problem (HD we asso¬ 
ciate a network composed of several nodes [A^] = {I,-- - , A'^} that can exchange 
information according to a communication graph Q = ([A'^],!?), where E denotes 
the set of edges, i.e. (j,j) G E C [A'^j x [N] models that node j sends information 
to node i. We assume that the graph Q is undirected and connected. For an integer 
r > 2, we also define with Vt the set of paths of r vertices in the graph. Note 
that we have at most (j^) paths of r vertices in a graph. The local information 
structure imposed by the graph Q should be considered as part of the problem 
formulation. 
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Our goal is to devise a distributed algorithm that iteratively solves the convex 
problem o by passing the estimate of the optimizer only between neighboring 
nodes along paths of r vertices. There is great interest in designing such distributed 
and parallel algorithms, since centralized algorithms scale poorly with the number 
of nodes and are less resilient to failure of the central node. We use the notation: 

x=[xi---Xn]'^ and f{x) = fi{xi)^ -l-/jv(a;jv)- 

Let us define the extended subspace S C and its orthogonal complement 
T C 

S' = < a: : a:i = 0 > , T = {u : ui = ■ ■ ■ = un}- 

The basic assumption considered in this paper is: 

Assumption 2.1 We assume that each function ft is convex and has Lipschitz 
continuous gradient with constants Li > 0 , i.e. the following inequality holds: 

\\yMx^)-yMy^)\\<L4x^-y4 Vx.,2/. gM. (2) 

From the Lipschitz property of the gradient m, the following inequality holds for 
all Xi,di G K [TT] : 

fz{xi + dO < fi{xi) + {V fi{xi),di) + ^\\di\f. (3) 

We denote with X* the set of optimal solutions for problem ©• Note that x* is 
optimal solution for o if and only if: 

N 

y]a:*=0, V fi{x*) = V fj{x*) Vi/jG[fV]. 

i=l 


3 Random coordinate descent algorithms 

In this section we devise randomized block coordinate gradient descent algorithms 
for solving the separable convex problem © and analyze their convergence. Since 
we have coupled constraints in the problem, the algorithm has to update in parallel 
T > 2 components per iteration. Usually, the algorithm can be accelerated by 
parallelization, i.e. by using more than one pair of coordinates per iteration. Our 
approach allows us to analyze in the same framework several methods: full gradient 
(t = TV), serial random coordinate descent (r = 2) and any parallel random 
coordinate descent method in between (< 2t < TV). Let us fix TV > t > 2 and we 
denote with AT" G Pr a path of r vertices in the connected undirected graph Q. We 
also assume available a probability distribution pjq- over the set Vt of paths of t 
vertices in the graph Q. Then, we can derive a randomized t coordinate descent 
algorithm where we update at each iteration only r coordinates in the vector x. 
Let us define J\f = (ii,---ir) G Vt, with ii G [TV], sjy = G 

Ljif = [Lij • • • G and V/v = [V/i^ • • • G Under assumption 

m the following inequality holds: 

fix+ ^ + hlsArWo^^. 

ieX 


( 4 ) 
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Based on the inequality @ we can devise a general randomized t coordinate 
descent algorithm for problem dH. let us call it RCDt ■ Given an x in the feasible 
set S, we choose the coordinate r-tuple M EVt with probability pj^. Let the next 
iterate be chosen as follows: 


= a: + ^ adi, 


ieAf 


i.e. we update t components in the vector x, where the direction d_\f is determined 
by requiring that the next iterate x^ to be also feasible for (HD and minimizing 
the right hand side in (|1|), i.e.: 

1 2 

dAt = arg min + + 


or explicitly, in closed form: 


di — 


1 E,6A^7^(V/a(^a)-V^(^0) 

T- 


Mi e M. 


In conclusion, we obtain the following randomized r coordinate gradient descent 
method: 

Algorithm RCD-r 

1. choose r-tuple A4 = (ii, • • • v) with probability pA/" 

2. set = Xi Mi ^ Mk and = xi -\- dll Mi & Nk- 

Clearly, algorithm RCDt is distributed since only neighboring nodes along a path 
in the graph need to communicate at each iteration. Further, at each iteration 
only r components of x are updated, so that our method has low complexity per 
iteration. Finally, in our algorithm we maintain feasibility at each iteration, i.e. 
x\ x% = 0 for all A: > 0. The random choice of coordinates makes the 

algorithm adequate for parallel and distributed implementations and thus more 
flexible than greedy coordinate descent methods [DIIS]- In particular, for t = 2, 
we choose one pair of connected nodes {ik,jk) £ E with some given probability 
Pikjk and obtain the following basic iteration: 

^k+i ^ __A_(ei^ _ ejJ (vfj^ix'lJ - . 

Note that our algorithm belongs to the class of center-free methods (in m the 
term center-free refers to the absence of a coordinator) with the following iteration: 

x'l^^ =Xi + YI, iyfjix’j) - ^Mxi)'j Mi e [A], (5) 

jeAT 


with appropriate weights wlj. Based on the inequality o and the optimality 
conditions for the subproblem corresponding to dj^, the following decrease in the 
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objective function values can be derived: 


< fix) - 

ieM 


T- i^fji^j) - ^f^ix^))\ 




L 


^)2 


= fix) - -V/(a:)^G^V/(a:), 


where the matrix Gj\f is defined as follows: 

G^ = Dll- ^ ( 6 ) 

where, with an abuse of notation, Lj\f G denotes the vector with components 
zero outside the index set A/" and components Li for i G A/". Therefore, taking the 
expectation over the random r-tuple M G Vt, we obtain the following inequality: 

E[fix+) I a:] < fix) - ^Vfix)^GrVfix), (7) 


where Gt = 'Yf.Me'P interpreted as a weighted Laplacian for 

the graph Q. From the decrease in the objective function values given above, 
it follows immediately that the matrix Gt is positive semidefinite and has an 
eigenvalue Ai (Gr) = 0 with the corresponding eigenvector e G T. Since the graph is 
connected, it also follows that the eigenvalue Ai(Gr) = 0 is simple, i.e. A2(Gr) > 0. 
On the extended subspace S we now define a norm that will be used subsequently 
for measuring distances in this subspace. We define the primal “norm” induced by 
the positive semidefinite matrix Gt as: 


||u||g^ = \/ u^GtU Vm G . 


Note that ||w||Gt = 0 for all u G T and ||u||gt > 0 for all u G K" \ T. On the 
subspace S we introduce its extended dual norm: 


II^^IIg = max (x,u) Va; G S'. 


Using the definition of conjugate norms, the Cauchy-Schwartz inequality holds: 
(u,a:) < ||m||g 2 • lkllG 2 Va: G S, u G 

Let us define the average value: u = ^ 12^=1 Then, the dual norm can be 
computed for any x £ S as follows: 

II^^IIg = max {x,u) = max {x,u — ue) 

^ {GtU,u)‘^ 1 u\{GT{LL — ue),u — ue)‘^l 

= max = max 

u:{GtU,u)<1,Y^^i u:{GTU,u)<l,e'^u=0 

= max (x, u) 

u:{GrU,u)< l,u'^ ee'^ u<0 

= min max [(a;, u) +/iff — (GtU, u)) — i^(ee^u. It)] 

= min a((aGr + J^ee )~ = minmin[/i H— ((Gt-\ -ee )~ x.x)] 

u>0 fi>0 fl M 

= mn \J ((Gt + C,ee^)~^x,x). 
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In conclusion, we obtain an extended dual norm that is well defined on subspace S: 

||2:||g^ = mn {{Gt + Cee'^)”^ x, x) Va; G S'. (8) 

Using the eigenvalue decomposition of the positive semidefinite matrix Gt = 
S'diag(0, A 2 , • • • , Ajv)S'^, where Ai are its positive eigenvalues and S = [e ^2 • • • ^n] 
such that (e, ^i) = 0 for all i, then: 

{Gr + = “'diag(C||e||^, A 2 , ■ • • , Ajv)“^S'^. 

From dHl) it follows immediately that our defined norm has the following closed 
form expression: 

Ikllo^ = Vx'^Gtx Mx e S, (9) 

where Gt = S'diag(0, • • • , denotes the pseudoinverse of the matrix Gt- 


3.1 Convergence rate: smooth case 


In order to estimate the rate of convergence of our algorithm in the smooth case 
fAssumDtion l2.1|) we introduce the following distance that takes into account that 
our algorithm is a descent method: 


n{x°) 


max min llx — x* 

{x^S\f{x)<f{x^)} x*^X* 


GtI 


which measures the size of the level set of / given by We assume that this 
distance is finite for the initial iterate After k iterations of the algorithm, we 
generate a random output {x^, f{x'^)), which depends on the observed implemen¬ 
tation of random variable: 

= (A/o, • • • ,A4). 

Let us define the expected value of the objective function w.r.t. 77 ^: 

= U [fix'^)] . 


We now prove the main result of this section, i.e. sublinear convergence in mean 
for the smooth convex case: 


Theorem 3.1 Let Assumvtion \2. 1\ hold for the optimization problem © and the 
sequence {x^)k>o be generated by algorithm RGDt- Then, we have the following 
sublinear rate of convergence for the expected values of the objective function: 


4>k 


r < 


2Ty{x°) 


k 


( 10 ) 
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Proof Recall that all our iterates are feasible, i.e. G S. From convexity of / and 
the definition of the norm || • on the subspace S, we get: 

fix) - r < (V/(:r‘),x‘ - X*) < llo;' - x*||&J|V/{x‘)||g. 
<n{x°)-\\Vf{x^)\\G^ VI >0. 

Combining this inequality with we obtain: 

or equivalently 

Taking the expectation of both sides of this inequality in rji-i and denoting Ai = 
(j)i — f* leads to: 

Ai+i <Ai- 27^2 (^0)- 

Dividing both sides of this inequality with AiAi+i and taking into account that 
Ai+i < Ai (see ©), we obtain: 

A'l - A^^ ~ 2TZHx0) 

Adding these inequalities from I = 0, - ■ ■ ,k—l we get that 0 < -^ < — 2 ti‘^(x'^) 

from which we obtain the statement (1101) of the theorem. □ 

Theorem 13.II shows that for smooth convex problem m algorithm RCDt has sub- 
linear rate of convergence in expectation but with a low complexity per iteration. 
More specifically, the complexity per iteration is 0(rn/ -|- t), where nf is the 
maximum cost of computing the gradient of each function fi and t is the cost of 
updating x'^. We assume that the cost of choosing randomly a r-tuple of indices 
M for a given probability distribution is negligible (e.g. for r = 2 the cost is In N). 


3.2 Convergence rate: strongly convex case 

Additionally to the assumption of Lipschitz continuous gradient for each function 
fi (see Assumption (EU). we now assume that the function / is also strongly 
convex with respect to the extended norm || • ||g^ with convexity parameter ao^ 
on the subspace S. More precisely, the objective function / satisfies for all x,y £ S: 

fix) > fiy) + iyfiy),x -y) + ^ (||® - • (H) 

We now derive linear convergence estimates for algorithm RCDr under the addi¬ 
tional strong convexity assumption: 
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Theorem 3.2 Let Assumption (EH) hold and additionally we assume f to be also 
cTG^-strongly convex function with respect to norm || • ||q^ (see (HID ). Then, for 
the sequence (ai^)fc>o generated by algorithm RCDt we have the following linear 
estimate for the convergence rate in expectation: 

4>k-r <{i-aG^t[f{x°)-r). (12) 

Proof From © we have: 

2 (/(x'') - E [f{x'^+^) I x'']) > ||V/(x'=)||^^. 

On the other hand, minimizing both sides of inequality dm over X £ S we have: 

||v/(t/)||^^ >2aG.(/{y)-r) ^yes 
and for y = x^ we get: 

\\Vf{x^)fG^>2aG^ (^fix'^)-ry 

Combining the first inequality with the last one, and taking expectation in pk-i 
in both sides, we prove the statement of the theorem. □ 

We notice that if fi ’s are strongly convex functions with respect to the Euclidian 
norm, with convexity parameter ai, i.e.: 

Mxi)> f^{yi) + {Vfi{yi),x^-y^) + Y\x^-yi\‘^ '^Xi,yi, z G [fV], 

then the whole function / = fi is also strongly convex w.r.t. the extended 
norm induced by the positive definite matrix Dcr, where a = [a\ ■ ■ ■ crjv]^, i.e.: 

fix) > f{y) + {Vfiy),x- y) + \\\x- Va:,y. 

Note that in this extended norm || • \\d„ the strongly convex parameter of the 
function / is equal to 1. It follows immediately that the function / is also strongly 
convex with respect to the norm || • ||g^ with the strongly convex parameter 
satisfying: 

gg^D^ a Gt + C,ee , 

for some ^ > 0. In conclusion, the strong convexity parameter gg,. needs to satisfy 
the following LMI: 


ggJn < oV^iGr + (ee^)Dl/^. ( 13 ) 

Finally, we should notice that we can also easily derive results showing that the 
problem is approximately solved with high probability in both situations, smooth 
and/or strongly convex case, see e.g. [7] for details. 
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3.3 How the number of updated blocks t enters into the convergence rates 

Note that matrix Gt depends directly on the number of components r to be up¬ 
dated and therefore, R{x°) is also depending on t. Moreover, the convergence rate 
can be explicitly expressed in terms of r for some specihc choices for probabilities 
and for the graph Q. In particular, let us assume a complete graph Q and that we 
know some constants Ri > 0 such that for any x satisfying f{x) < /(x°) there 
exists an x* G X* such that: 

\xi — Xi\ < Ri Vi G [N], 

and recall that L = [Li ■ ■ ■ Ljv]^ and Dl is the diagonal matrix with entries on the 
diagonal given by the vector L. Moreover, let us consider probabilities depending 
on the Lipschitz constants Li for any path M £ Vt of t vertices in the complete 
graph Q, defined as: 


L 

PM = 




'^Mev^ '^ieM \/L{ 


(14) 


Theorem 3.3 Under Assum,vtion \2. 1\ and for the choice of probabilities m on 
a complete graph Q, the following sublinear convergence rate for algorithm RCDr 
in the expected values of the objective function is obtained: 


(pk- f* < 


N-l 2Y,,L^Rf 


T — 1 k 

Proof Using the dehnition of the indicator function l^y, we can see that: 


1=1*=i 


Mev-r- ieM 1=1 ieMi 

N / , , 

= Ei E'v,w|=E 

1=1 yi=i 


1 / T - 1 

T,\n-i 


Thus, using Gt = J^MeV PxGu and the expression of Gm given in (O, we can 
derive that matrix Gt has the following expression: 


Gt = — r y 


Y.T.^Z-LM\L-^y 




MeVr lieM 

(;) 


E. 


j=i 

1 /t- 2 


leMj 


Xp^\N-2 


T — 1 

N-l 


D7^- 


.i=l ^ 

1 
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Using the previous expression for Gt in ( 0 , we get: 

(lla^llcj^ = X! -^*1**1^ Va:GS. (15) 

i6[JV] 

Using the definition for Ri and the expression (US for the norm on S, we obtain: 
TZ^{x°) < TZ^{R) = 

ie[N] 

where R = [i?i ■ ■ ■ Using this expression for TZ{x^) in Theorem 13.II we get 

the statement of the theorem. □ 

Note that for t = N we recover the convergence rate of the full gradient method, 
while for t = 2 we get the convergence rate of the basic random coordinate descent 
method. Thus, the theoretical speedup of the parallel algorithm RCDr in terms of 
the number of iterations needed to approximately solve the problem, as compared 
to the basic random coordinate descent method, is equal in this case to t - the 
number of components to be updated (number of processors available). This result 
also shows that the speedup achieved by our method on the class of separable 
problems with coupling constraints is the same as for separable problems without 
coupling constraints. 

For the strongly convex case, we note that combining the Lipschitz inequality ([3]) 
with the strong convex inequality m we get: 

L^\xi - y^f > CTG, (Ik - vWhy Vs, yeS. 

^e[N] 

Now, if we consider e.g. a complete graph and the probabilities given in m, then 
using the expression for the norm || • ||g^ given in m we obtain ao^ < n-\ • 
Thus, in the strongly convex case the linear convergence rate in expectation (HD 
can be also expressed in terms of r. 


4 Design of optimal probabilities 

We have several choices for the probabilities (pa/')a/'6Rx corresponding to paths 
of T vertices in the complete graph Q, which the randomized coordinate descent 
algorithm RCDr depends on. For example, we can choose probabilities dependent 
on the Lipschitz constants Li\ 

^ AfeVr ieAf 

Note that for a = 0 we recover the uniform probabilities. Finally, we can design 
optimal probabilities from the convergence rate of the method. From the definition 
of the constants Ri it follows that: 

R-ix^) < TZiR) = max IkHc i with Gr = PAfGjif. 

xeS,\xi\<R, " 
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We have the freedom to choose the matrix Gt that depends linearly on the prob¬ 
abilities pjv'- For the probabilities {pM)N'e'P^ corresponding to paths of r vertices 
in the complete graph Q we dehne the following set of matrices: 


= Gr: G. = ^ pmGm, Gm = Dll - sr \/r ■ 

I A/-6P. ] 

Therefore, we search for the probabilities pj^ that are the optimal solntion of the 
following optimization problem: 

72,*(a:°) = min72,(a:°) < min7?,(i?) = min max ||a;||G ■ 

PV VN Gr^M X^S,\Xi\<Ri ’’ 


Let US define R = [i?i • • • i?Ar]^ and u = [ui - ■ ■ In the next theorem we derive 

an easily computed upper bound on TZ*{xo) and we provide a way to suboptimally 
select the probabilities pj^: 

Theorem 4.1 Let Assumption \27J\ hold. Then, a suboptimal choice of probabili¬ 
ties {pAf)j^eVT be obtained as a solution of the following SDP problem whose 
optimal value is an upper bound on TZ*{xo), i.e.: 


(TZ*(xo)) < min 
^ ' GreMX>o 


,iy>0 { 


{r,RI : 


Gt + Cee'^ I]\f 
In Di, 




(17) 


Proof Using the definition of TZ{R) and of the norm || • we get: 

min(7?.(i?))^ < min max (||a;||G 
PV ^ GtEM xeS,||a;hl<-Ri ^ " 

= min max min/fGr + Cee^) ^x.x) 

GtEM a;eS,||a;i||<fii C>0 

= min max ((Gt + Cee^)~^a:, a:) 

GTeMX>0 a;6S.||a;i||<fli 

= min max ((Gt + Cee^)~^, 

GtGMX^O a:6S.||a;i||<fld 

= min max((GT + Cee^)~^,X), 

Gt<emx>o xex'^ 

where X = {X : X ^ 0, rankX = 1, {ee^,X) = 0, {X,Eii) < Vi} and 
Eii = CicJ . Using the well-known relaxation from the SDP literature, we have: 


min(7?,(i?))^ < min max ((Gt + Cee“^) ,X), 

PJ\f G X E ; C ^ 0 ^ 


where AV = {X : X y 0, (ee^,X) = 0, {X,Eii) < R^ Vi}, i.e. we have removed 
the rank constraint: rankX = 1. Then, the right hand side of the previous opti¬ 
mization problem can be reformulated equivalently, using Lagrange multipliers, as 
follows: 


min max ((Gt + Cee^) ^,X) 

GT6At,C>o xexp' 

min max [((Gt + Cee^)”^^0ee^, X) 

GTeMXw>o,z'^o,oev. L 


N 

+ J2r.{R^-{X,Eu))]. 

i=l 
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where v = [i>\ - ■ ■ Rearranging the terms, we can write the previous convex 

problem equivalently: 

min ViRi 

GxgA1,0eR,ZtO,C,i^>O 

i 

+ max ((Gr + Cee^)“^ + Z + 0ee^ - ViEii,X)\ 

i 

= min I'iRi, 

{G^,z,c,i',0)eT^ 

I 

where the feasible set is described as: E = {(Gt, Z, v, 9) : Gr G Xi, 9 ^ R, Z y 
> 0, (Gr + Cee^)~^ + Z + 9ee'^ — = O}- Moreover, since Z y 0, the 

feasible set can be rewritten as: 

{(Gr,C,i^,^) : Gr eM,9eR,(,iy> “ (Gr + Cee'^)~^ - 9ee'^ t O}. 


We observe that we can take 0 = 0 and then we get the feasible set: 

{ (Gr, C, 0') ■ Gr G M, (^,V > 0, Gr + (^66 ^ D,^ } . 

In conclusion, we obtain the following SDP: 

min(7?.(-R))^ < min 

pv GreM,c,u>o,Gr+Cee'^hD;:^ 

Finally, the SDP (I17II is obtained from Schur complement formula applied to the 
previous optimization problem. □ 


Since we assume a connected graph Q, we have that Ai(Gr) = 0 is simple and 
consequently \ 2 (Gr) > 0. Then, the following equivalence holds: 

ee^ 

Gr + t .. ^ IIn if and only if t < \ 2 (Gr), (18) 

ll®ll 

since the spectrum of the matrix Gr + Cee^ is {Cl|e||^, A2(Gr), • • • ,Aiv(Gr)}. It 
follows that C = liTp, ~ 1 b such that t < \ 2 (Gr) is feasible for 

the SDP problem (I17II . We conclude that: 



< 


min 

GrgA1,C,!^>0,Gr + Cee’’tDy^ 

^ 1 

min y^ R? - < 

Gr6M,t<A2(Gr) “ t 
1=1 


{u,R^) 

A2(Gr) 


VGr G M. 


(19) 


Then, according to Theorem l3.1l we obtain the following upper bound on the rate 
of convergence for the expected values of the objective function in the smooth 
convex case: 




r < 


A2(Gr) • k 


VGr G M. 


( 20 ) 
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From the convergence rate for algorithm RCDt given in II20I) it follows that we 
can choose the probabilities such that we maximize the second eigenvalue of Gt '■ 

max \ 2 (Gt)- 

Gr^M 

In conclusion, in order to hnd some suboptimal probabilities (pa/')a/'6T’x ; 
solve the following simpler SDP problem than the one given in CZI): 

Corollary 4.1 From dm) we get for the smooth ease that a suboptimal choice of 
probabilities {pM)N'e'P^ can be obtained as a solution of the following SDP problem: 

p*^ = arg ^ max^ |t : G. ^ t . (21) 

Note that the matrices on both sides of the LMI from m have the common 
eigenvalue zero associated to the eigenvector e, so that this LMI has empty interior 
which can cause problems for some classes of interior point methods. We can 
overcome this problem by replacing the LMI constraint in (I21II with the following 
equivalent LMI: 

+ PP - * ■ pp) • 

Finally, when the functions fi are (Ti-strongly convex, from Theorem 13.21 and the 
LMI (1131) it follows that in order to get a better convergence rate we need to search 
for (JG^ as large as possible. Therefore, we get the following result: 

Corollary 4.2 For the strongly convex case the optimal probabilities are chosen 
as the solution of the following SDP problem: 

pIj-= a.rg max {o-g, ■ ctgJn ^ D^^^{Gr + Cee'^)Dl^H . (22) 

In |17| . the authors propose a (center-free) distributed scaled gradient method in 
the form dS]) to solve the separable optimization problem m with strongly convex 
objective function, where at each iteration the full gradient needs to be computed. 
A similar rate of convergence is obtained as in Theorem 13.21 under the Lipschitz 
and strong convexity assumption on /, where the weights are designed by solving 
an SDP in the form (EH). Our randomized algorithm also belongs to this class of 
methods and for t = N we recover a version of the method in m- Moreover, 
our convergence analysis covers the smooth case, i.e. without the strong convexity 
assumption. 


5 Applications 

Problem m arises in many real applications, e.g. image processing [3j|31[T9], re¬ 
source allocation diia and coordination in multi-agent systems [71IT8]. For exam¬ 
ple, we can interpret © as N agents exchanging n goods to minimize a total cost, 
where the constraint = 0 is the equilibrium or market clearing constraint. 

In this context [xi]j > 0 means that agent i receives [xi]j of good j from exchange 
and [xi\j < 0 means that agent i contributes \ {xi)j\ of good j to exchange. It can 
be also viewed as the distributed dynamic energy management problem: N devices 
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exchange power in time periods t = 1, ■ ■ ■ ,n. Furthermore, Xi G M” is the power 
flow profile for device i and fi{xi) is the cost of profile Xi (and usually encodes 
constraints). In this application the constraint ~ 0 represents the energy 

balance (in each time period). 

Problem © can also be seen as the dual corresponding to an optimization of a sum 
of convex functions. Consider the following primal convex optimization problem 
that arises in many engineering applications: 

g* = min gi{v)-\ - \-gN{v), (23) 


where gt are all Ui-strongly convex functions and Qi C R" are convex sets. De¬ 
note with V* the unique optimal solution of problem (1231) . This problem can be 
reformulated as: 


min 

Ui^Qi,Ui=v 


w +-h5iv(wiv). 


Let us define u = [ui ■ ■ ■ and g{u) = gi{ui) + - ■ • +gjv(uv). By duality, using 

the Lagrange multipliers Xi for the constraints m = v, we obtain the equivalent 
convex problem o, where fi{xi) = gt{xi) and g* is the convex conjugate of the 
function gi = gi + 1q. , i.e. 


fi{xi) = max {xi,Ui) - gi{ui) Vi. 

UiEQi 


(24) 


Further we have f* +g* = 0. Note that if gt is cri-strongly convex, then the convex 
conjugate fi is well-defined and has Lipschitz continuous gradient with constants 
L, = ^ (see mi so that Assumption 12.11 holds. A particular application is the 
problem of finding the projection of a point vo in the intersection of the convex 
sets nfLiQi C K". This problem can be written as an optimization problem in the 
form: 

min pi||r> - noll^-I-+Piv||r'- volP, 

venfLiQi 

where pi > 0 such that pi = 1. This is a particular case of the separable problem 
([231). Note that since the functions gi(v) = pi\\v — r:o||^ are strongly convex, then 
fi have Lipschitz continuous gradient with constants Li = l/pi for all i G [A^]. 

We now show how we can recover an approximate primal solution for the primal 
problem (1231) by solving the corresponding dual problem (|T]) with algorithm RCDr- 
Let us define for any dual variable Xi the primal variable: 

Ui{xi) = arg min g^{ui) - {xi,Ui) Vf G [N], 

UiEQi 

Let us define v* = e0v*, where 0 is the Kronecker product, and the norm ||u||£ 5 _^ = 
Moreover, let a^iin = mim Ui and Ajv the largest eigenvalue of Gt- 
Furthermore, for simplicity of the presentation we consider the initial starting 
point = 0. Then, we can derive convergence estimates on primal infeasibility 
and suboptimality for (1231) . 
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Theorem 5.1 For the convex optimization problem (1231) we assume that all func¬ 
tions gi are oi-strongly convex. Let x^ be the sequence generated by algorithm 
RCDr for solving the corresponding dual problem ([T|) and the primal sequence 
= u{x^). Then, we have the following convergence estimates in the expected 
values on primal infeasibility and suboptimality: 




and E\\giu^)-g*\\<^-^^. 
^ ^ O-minVfc 


Proof Since all the functions gi are ai-strongly convex, then the objective function 
Y^^—igi{ui) — {xi,Ui) is also 1 -strongly convex in the variable u w.r.t. the norm 
II^IId, = J2i o'illwilP- Using this property and the expression of Ui{xi), we obtain 
the following inequalities: 


1 , 


N 


i{x) - v*\\d_^ = Y^\ui{xi) - 

i=l 

- ( - {Xz,v*) I - I Y 9i{Ui{Xi)) - {Xi,Ui{Xi)) 

= i-n - i-fix)) = fix) -r ^xe s. 


(25) 


Now, let us consider for x the sequence x^ generated by algorithm RCDr and let 
We note that = u’f for all i G [N] \ A4 and = Ui{x’^'^^) for 
all i G A/fc. Taking expectation over the entire history 77 ^ and using Theorem 13.11 
we get an estimate on primal infeasibility: 


E[\\u'^ - v*\Y] < m/ix^) - /*]=2(<Afc - n < 


4R^{x°) 

k 


Moreover, for deriving estimates on primal suboptimality, we first observe: 

Aiv 


MOr < 


IWWOr VU, 


and combining with (1251) we get: 


lix) - i)*||G^ < 


AivV2(/(a;) - /*) 


(26) 


For the left hand side suboptimality, we proceed as follows: 

fix*) = {x*,v*) - giv*) 

= max {x*,u) - giu) > {x*,uix)) - giuix)), 

Ui^Qi 

which leads to the following relation: 

giuix)) - g* > {x*,uix) - v*) > -||x*||gJ|i)* - w(a;)||G., 


J26ll 

> -lla:* 


IG,- 


Aivv/2(/(cc)-/*) 


Vs G S. 


( 27 ) 


Secondly, from the definition of the dual function, we have: 

fix) = {x, m(s)) - giuix)) Mx G S. 
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Subtracting /* = f{x*) from both sides and using the complementarity condition 
{x*,u{x*)) = 0, where u{x*) = v*, we get the following relations: 


g{u{x)) - g* = {x,u{x)) - f{x) + f* 

= f{x*) — f{x) + {x — x* ,u{x*)) + {x,u{x) — u{x*)) 

= f{x*) + {x- x*,\/f{x*)) - f{x) + {x,u{x) - u{x*)) 

< {x,u{x) - u(x*)} < IIxIIgJIu* - u{x)\\g^ 

< (llx - x*\\*G, + ||a;*||Gj ||u(a;) - t}*||G, 

^ m *11* , II *11* \ >^N^/‘2{f{x) - /*) 

< (Ik-a; IIg,+ l|a; IIgJ-—-, 


valid for all a: G S' and x* £ X*, where in the first inequality we used convexity 
of the function / and the relation X f(x) = u{x), and in the second inequality 
the Cauchy-Schwartz inequality. Now, using the definition of R{x^) and that x^ = 
0, replacing x with the sequence x^ in the previous derivations and taking the 
expectation over the entire history rj^, we obtain a bound on primal suboptimality: 


E[\g{u^)-g*\]<E 


/ii k *11* , II *11* \ AivV2(/(a:'=) - /*) 

(||x -a; IIg,+ l|a: IIgJ-- 

' ' CTmin 


min 
2/_0\ 


< 


2R{x )Xn l 2 E[f{x^^) - /*] < 4-^ 

(Trv,i„ ' rr^-.^y/k 


which gives us a convergence estimate for primal suboptimality for problem (12311 . 

□ 


In conclusion, the expected values of the distance between the primal generated 
points Ui G Qi and the unique optimal point v* G HiQi of (1231) . i.e. — u*||], 

is less than Similar convergence rates for other projection algorithms for 

solving convex feasibility problems have been derived in the literature, see e.g. I3llil . 


6 Numerical experiments 


In this section we report some preliminary numerical results on solving the opti¬ 
mization problem where the functions fi are taken as in paper m- 

fi{xi) = ^ai{xi - Ci)'^+\og(l + eyip{bi{xi - di))) Vi G [N], (28) 

where the coefficients at > 0,bi,Ci and di are generated randomly with uniform 
distributions on [—15, 15]. The second derivatives of these functions have the 
following expressions: 


/, {xi) =a^ + 


bf exp(6i(a^i - di)) 


(1 -I-exp(6i(a:i - di)))^’ 
which have the following lower and upper bounds: 

(Ti = at and Li = at + —bi. 
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We assume that the sum of the variables if fixed to zero, i.e.: 

N 

= 0. 

i=l 

In all our numerical tests we consider a complete graph and Lipschitz dependent 
probabilities as in (1141) (if not specified otherwise). 

In the first set of experiments, we solve a single randomly generated problem 
with N = 10^ nodes for r = 2,4 and 7 cores in parallel using MPI. The Fig. 1 
displays the evolution of f[x^) — f* along normalized iterations k/N of algorithm 
RCDt- From the plot we can observe that increasing the number of cores reduces 
substantially the nnmber of full iterations k/N. 


Fig. 1 Typical performance of algorithm RCDt for different numbers of updated variables 
(processors) r: evolution of f{x^) — f* along normalized iterations k/N, with r = 2,4 and 7. 





10 r 


10 


10 


10 " 


*■» 


■T = 2 
■r = 4 
■r = 7 






10 15 20 25 

k/N 


30 


Then, we tested algorithm RCD 2 , i.e. r = 2, and thus at each iteration we choose 
a pair of nodes in the graph {i,j) G E with probability pij and then update only 
the components i and j of x as follows: 


xt = X 


= x^ + 


1 




iVfj{x,)-VMx,)), 


x+=x 


= Xj + 


1 


Li~\~L 


■{Nh{xi)-Vf,{x,)). 


We consider three choices of the probabilities in Fig. 2 for algorithm RCD 2 '. uni¬ 
form probability, probabilities depending on Lipschitz constants as given in m 
and optimal probabilities obtained from solving the SDP (EH). As we expected, the 
method based on choosing the optimal probabilities has the fastest convergence. 

Finally, we compare our algorithm RCD 2 , i.e. r = 2, for two choices for the 
probabilities pij (uniform and Lipschitz dependent probabilities (11411 ) with the 
full gradient method and the center-free gradient method with Metropolis weights 
proposed in m- The global Lipschitz constant in the full gradient is taken as 
Lmax = maxi Li. Note that the computations of the local Lipschitz constant Li 
required by algorithm RCDt can be done locally in each node for the correspond¬ 
ing function fi, while computing the global Lipschitz constant Lmax for projected 
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Fig. 2 Evolution of f{x^) — f* along full iterations for algorithm RCD 2 for different choices 
for the probability: uniform probabilities, Lipschitz dependent probabilities as given in (11411 
and optimal probabilities obtained from solving the SDP problem dHJ. 



Fig. 3 Evolution of f{x^) — f* along full iterations for the methods: projected gradient, 
center-free gradient method with Metropolis weights El and algorithm RCDr with uniform 
and Lipschitz dependent probabilities for r = 2. 



gradient method on problems of very large dimension is difficult. We have also 
implemented the center-free gradient method with Metropolis weights from m- 

xt mjiVfjixj) - VMxr)) Vi G [N], 

jeN'i 


where A/i are the neighbors of node i in the graph and the weights satisfy the 
relation: 3 we plot the evolution of f{x^) — /* along 

full iterations k/N for the following methods: the center-free gradient algorithm 
from m with the Metropolis weights, the full projected gradient algorithm and the 
RCD 2 algorithm with uniform and Lipschitz dependent probabilities. We clearly 
see that the best accuracy is achieved by the RCD 2 algorithm with Lipschitz 
dependent probabilities. 
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7 Conclusions 

In this paper we have derived parallel random coordinate descent methods for min¬ 
imizing linearly constrained convex problems over networks. Since we have coupled 
constraints in the problem, we have devised an algorithm that updates in parallel 
r > 2 (block) components per iteration. We have proved that for this method 
we obtain in expectation an e-accurate solution in at most (!?( — ) iterations and 
thus the convergence rate depends linearly on the number of (block) components 
to be updated. Preliminary numerical results show that the parallel coordinate 
descent method with r > 2 accelerates on its basic counterpart corresponding to 
r = 2. For strongly convex functions the new method converges linearly. We have 
also provided SDP formulations that enable us to choose the probabilities in an 
optimal fashion. 


References 

1. A. Beck, The 2-coordinate descent method for solving double-sided simplex constrained 
minimization problems, Journal of Optimization Theory and Applications, 162(3), 892— 
919, 2014. 

2. A. Beck and L. Tetruashvili, On the convergence of block coordinate descent type methods, 
SIAM Journal on Optimization, 23(4), 2037-2060, 2013. 

3. H. Bauschke, J.M. Borwein, On projection algorithms for solving convex feasibility prob¬ 
lems, SIAM Review, 38(3), 367-426, 1996. 

4. P. L. Combettes, The convex feasibility problem in image recovery, in Advances in Imaging 
and Electron Physics (P. Hawkes Ed.), 95, 155-270, Academic Press, 1996. 

5. H. Ishii, R. Tempo and E. Bai, A web aggregation approach for distributed randomized 
pagerank algorithms, IEEE Transactions Automatic Control, 57, 2703—2717, 2012. 

6. Ji Liu and S. Wright, Asynchronous stochastic coordinate descent: parallelism and con¬ 
vergence properties, SIAM Journal on Optimization, 25(1), 2014. 

7. I. Necoara, Random coordinate descent algorithms for multi-agent convex optimization 
over networks, IEEE Transactions Automatic Control, 58(8), 2001—2012, 2013. 

8. I. Necoara and A. Patrascu, A random coordinate descent algorithm for optimization 
problems with composite objective function and linear coupled constraints, Computational 
Optimization and Applications, 57(2), 307—337, 2014. 

9. I. Necoara and D. Clipici, Parallel coordinate descent methods for composite minimiza¬ 
tion: convergence analysis and error bounds, SIAM Journal on Optimization, 1-29, 2016 
(http://arxiv.org/abs/1312.5302). 

10. I. Necoara and D. Clipici, Efficient parallel coordinate descent algorithm for convex opti¬ 
mization problems with separable constraints: application to distributed MPC, Journal of 
Process Control, 23(3), 243-253, 2013 

11. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Boston, 
Kluwer, 2004. 

12. Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization prob¬ 
lems, SIAM Journal on Optimization, 22(2), 341-362, 2012. 

13. Z. Qin, K. Scheinberg and D. Goldfarb, Efficient block-coordinate descent algorithms for 
the group lasso, Mathematical Programming Computation, 5(2), 143-169, 2013. 

14. P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent 
methods for minimizing a composite function. Mathematical Programming, 144(1-2), 1- 
38, 2014. 

15. P. Richtarik and M. Takac, Parallel coordinate descent methods for big data optimization. 
Mathematical Programming, 1-52, 2015. 

16. P. Tseng and S. Yun, A Block-coordinate gradient descent method for linearly constrained 
nonsmooth separable optimization. Journal of Optimization Theory and Applications, 140, 
513-535, 2009. 

17. L. Xiao and S. Boyd, Optimal scaling of a gradient method for distributed resource allo¬ 
cation, Journal of Optimization Theory and Applications, 129(3), 469-488, 2006. 



Random coordinate descent methods on large optimization problems 


21 


18. K. You and L. Xie, Network topology and communication data rate for consensusability of 
discrete-time multi-aqent systems, IEEE Transactions Automatic Control, 56(10), 2262- 
2275, 2011. 

19. S. Wright, Accelerated block coordinate relaxation for regularized optimization, SIAM Jour¬ 
nal on Optimization, 22(1), 159—186, 2012. 



